Fingerprinting multimedia contents

ABSTRACT

Disclosed is a method and arrangement for extracting a fingerprint from a multimedia signal, particularly an audio signal, which is invariant to speed changes of the audio signal. To this end, the method comprises extracting ( 12,13 ) a set of robust perceptual features from the multimedia signal, for example, the power spectrum of the audio signal. A Fourier-Mellin transform ( 15 ) converts the power spectrum into Fourier coefficients that undergo a phase change only if the audio playback speed changes. Their magnitudes or phase differences ( 16 ) constitute a speed change-invariant fingerprint. By a thresholding operation ( 19 ), the fingerprint can be represented by a compact number of bits.

FIELD OF THE INVENTION

The invention relates to a method and arrangement for extracting afingerprint from a multimedia signal.

BACKGROUND OF THE INVENTION

Fingerprints, in the literature sometimes referred to as hashes orsignatures, are binary sequences extracted from multimedia contents,which can be used to identify said contents. Unlike cryptographic hashesof data files (which change as soon as a single bit of the data filechanges), fingerprints of multimedia contents (audio, images, video) areto a certain extent invariant to processing such as compression and D/A& A/D conversion. This is generally achieved by extracting thefingerprint from perceptually essential features of the contents.

A prior-art method of extracting a fingerprint from a multimedia signalis disclosed in International Patent Application WO 02/065782. Themethod comprises the steps of extracting a set of robust perceptualfeatures from the multimedia signal, and converting the set of featuresinto the fingerprint. For audio signals, the perceptual features areenergies of the audio contents in selected sub-bands. For image signals,the percetual features are average luminances of blocks into which theimage is divided. The conversion into a binary sequence is performed bythresholding, for example, by comparing each feature sample with itsneighbors.

An attractive application of fingerprinting is content identification.The artist and title of a music song or video clip can be identified byextracting a fingerprint from an excerpt of the unknown material andsending it to a large database of fingerprints in which said informationis stored.

Experiments have shown that the prior-art method of extractingfingerprints from an audio signal is very robust against almost allcommonly used audio processing operations, such as MP3 compression anddecompression, equalization, re-sampling, noise addition, and D/A & A/Dconversion.

It is quite common for radio stations to speed up audio by a fewpercent. They supposedly do this for two reasons. First, the duration ofsongs is then shorter and therefore it enables them to broadcast morecommercials. Secondly, the beat of the song is faster and the audienceseems to prefer this. The speed changes typically lie between zero andfour percent.

Speed changes of audio material cause misalignment in both the temporaland the frequency domain. The prior-art fingerprint extraction methoddoes not suffer from misalignment in the temporal domain, because thefingerprint is a concatenation of small sub-fingerprints being extractedfrom overlapping audio frames. A speed change of; say 2%, merely causesthe 250^(th) sub-fingerprint of an excerpt to be extracted at theposition of the 255^(th) sub-fingerprint of the corresponding originalexcerpt.

Misalignment in the frequency domain is caused by spectral energiesshifting to other frequencies. The above example of 2% speedup causesall audio frequencies to increase by 2%. In the prior-art audiofingerprint extraction method, this causes the energies in the selectedsub-bands (and thus the fingerprint) to be changed. As a result thereof,the fingerprints can no longer be found in a database, unless aplurality of fingerprints corresponding to different speed versions isstored in the database for each song.

Similar considerations apply to image and video material and to otherkinds of perceptual features being used for fingerprint extraction.

OBJECT AND SUMMARY OF THE INVENTION

It is an object of the invention to provide an improved method andarrangement for extracting a fingerprint from multimedia contents. It isa particular object of the invention to provide a method and arrangementfor extracting a fingerprint from an audio signal that is substantiallyinvariant to speed changes of the audio signal.

To this end, the method of extracting a fingerprint from a multimediasignal according to the invention comprises the steps of: extracting aset of robust perceptual features from the multimedia signal; subjectingthe extracted set of features to a Fourier-Mellin transform; andconverting the transformed set of features into a sequence constitutingthe fingerprint.

The invention exploits the insight that the Fourier-Mellin transformconsists of a log mapping and a Fourier transform. The log mappingconverts scaling of the energy spectrum due to a speed change in ashift. The subsequent Fourier transform converts the shift into a phasechange which is the same for all Fourier coefficients. Magnitudes of theFourier coefficients are not affected by the speed change. A fingerprintderived from the magnitude or from the derivative of the phase of theFourier coefficients is thus invariant to speed changes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows schematically an arrangement for extracting a fingerprintfrom a multimedia signal or, equivalently, the corresponding steps of amethod of extracting such a fingerprint according to the invention.

FIGS. 2 and 3 show diagrams to illustrate the operation of a log mappingcircuit, which is shown in FIG. 1.

DESCRIPTION OF EMBODIMENTS

The invention will be described with reference to an arrangement forextracting a fingerprint from an audio signal. FIG. 1 showsschematically such an arrangement according to the invention.

The arrangement comprises a framing circuit 11, which divides the audiosignal into overlapping frames of approx. 0.4 seconds and an overlapfactor of 31/32. The overlap is to be chosen such that a highcorrelation between sub-fingerprints of subsequent frames is obtained.Prior to the division into frames, the audio signal has been limited toa frequency range of approx. 300 Hz-3 kHz and down-sampled (not shown),so that each frame comprises 2048 samples.

A Fourier transform circuit 12 computes the spectral representation ofevery frame. In the next block 13, the power spectrum of the audio frameis computed, for example, by squaring the magnitudes of the (complex)Fourier coefficients. For each frame of 2048 audio signal samples, thepower spectrum is represented by 1024 samples (positive andcorresponding negative frequencies have the same magnitudes). Thesamples of the power spectrum constitute a set of robust perceptualfeatures. The spectrum is not substantially affected by operations suchas D/A & A/D conversion or MP3 compression.

After calculating the power spectrum, an optional normalization circuit14 applies local normalization to the power spectrum. Such anormalization (which includes de-convolution and filtering) improves theperformance as it obtains a more decisive and robust representation ofthe power spectrum. Local normalization preserves the importantcharacteristics of the spectrum and is robust against all kinds of audioprocessing including local modifications of the audio spectrum, such asequalization. The most promising approach is to emphasize the tonal partof the spectrum by normalizing it with its local mean. Mathematically,the normalized spectrum N(ω) is obtained by dividing the spectrum A(ω)by its local mean Lm(ω) as follows:${N\quad(\omega)} = \frac{A\quad(\omega)}{{Lm}\quad(\omega)}$The local mean can be calculated in various ways, for example.$\begin{matrix}{{{{Lm}\quad(\omega)} = {\frac{1}{2\quad\delta}{\int_{\omega - \delta}^{\omega + \delta}{A\quad(\tau)\quad{\mathbb{d}\tau}\quad\left( {{arithmetic}\quad{mean}} \right)}}}},{or}} \\{{{Lm}\quad(\omega)} = {{\exp\quad\left\lbrack {\frac{1}{2\quad\delta}{\int_{\omega - \delta}^{\omega + \delta}{\log\quad A\quad(\tau)\quad{\mathbb{d}\tau}}}} \right\rbrack}\quad\left( {{geometric}\quad{mean}} \right)\quad{and}\quad{so}\quad{{on}.}}}\end{matrix}$The normalized spectrum remains invariant to equalization. Moreover,tonal information is directly related to human hearing and wellpreserved after most of the audio processing. The importance of tonalinformation is widely accepted and has been utilized in audiorecognition and bit allocation of audio compression. Although localnormalization has many advantages, the normalization is not consistentafter compression if there are no tonal components between ω−δ and ω+δ.To mitigate this effect, integration over time and a total-energy termis added to IL(ω). Then a modified local mean Lm′(ω) is given asfollows:${{Lm}^{\prime}\quad(\omega)} = {{\frac{1}{2\quad\delta}{\int_{t - \Delta}^{t}{\int_{\omega - \delta}^{\omega + \delta}{A\quad(\tau)\quad{\mathbb{d}\tau}}}}} + {\alpha{\int_{t - \Delta}^{t}{\int_{- \infty}^{\infty}{A\quad(\tau)\quad{\mathbb{d}\tau}}}}}}$where Δ and α are constants, which are determined experimentally.Integration over time makes the normalization more consistent, and thetotal-energy term limits the increase of small non-tonal componentsafter normalization.

The invention resides in the application of a Fourier-Mellin transform15 to the power spectrum to achieve speed change resilience. TheFourier-Mellin transform consists of a log mapping process 151 and aFourier transform (or inverse Fourier transform) 152.

FIGS. 2 and 3 show diagrams to illustrate the log mapping operation. InFIG. 2, reference numeral 21 denotes the samples of the power spectrumof an audio frame as supplied by the Fourier transform 12 in the casethat the audio signal is being played back at normal speed. For the sakeof convenience, a smooth power spectrum in the range 300-3,000 Hz isshown. In reality, the spectrum will generally exhibit a jagged outline.Reference numeral 22 in FIG. 2 denotes the power spectrum of the sameaudio frame in the case that the audio signal is being played back at anincreased speed. As can be seen in the Figure, the speed change causesthe power spectrum to be scaled.

FIG. 3 shows the corresponding power spectra as computed by the logmapping circuit 151. The power spectrum now represents the energy of theaudio frame in a selected number of successive logarithmically spacedsub-bands. Reference numeral 31 denotes the log mapped power spectrumfor the audio signal being played back at normal speed. Referencenumeral 32 denotes the log-mapped power spectrum for the audio signalbeing played back at the increased speed.

The process of log mapping can be carried out in several ways. In theembodiment, which is shown in FIG. 3, the input power spectrum isinterpolated and re-sampled at logarithmically spaced intervals. Inanother embodiment (not shown), the samples within logarithmicallyspaced (and sized) sub-bands of the input power spectrum are accumulatedto provide respective samples of the log-mapped power spectrum.

The number of samples representing the log-mapped power spectrum ischosen to be such that subsequent operations can be carried out withsufficient precision. In a practical embodiment, the log-mapped powerspectrum is represented by 512 samples. It will be appreciated frominspection of FIG. 3 that the log-mapping operation translates thescaling (21→22) of the power spectrum due to the speed change into ashift (31→32). As long as the playback speed of the audio signal doesnot change within the frame period (which is a reasonable assumption inpractice), the shift is the same for all coefficients.

The subsequent Fourier transform 152 translates said shift into a changeof the phase of the complex Fourier coefficients. The phase change isthe same for all coefficients. Thus, if the speed of the audio signalchanges, the phases of all Fourier coefficients computed by Fouriertransform circuit 152 change by an identical amount. In other words, themagnitudes of the coefficients as well as their phase differences areinvariant to speed changes. They are calculated in a computing circuit16. As the magnitudes and phase differences are the same for positiveand negative frequencies, the number of unique values is 256.

The vector of 256 magnitudes or phase differences representing thelog-mapped power spectrum of an audio frame is hereinafter denotedF(k,n), where k=1.256 and n is the audio frame number. In fact, thevector constitutes a speed change-invariant fingerprint. However, thenumber of values is large, and each value requires a multi-bitrepresentation in a digital fingerprinting system. The number of bits torepresent the fingerprint can be reduced by selecting the lowest-ordervalues only. This is performed by a selection circuit 17. It has beenfound that the 32 lowest values (the most significant coefficients)provide a sufficiently accurate representation of the log-mapped powerspectrum.

The number of bits can be further reduced by subjecting the selectedmagnitudes or phase differences to values to a thresholding process. Ina simple embodiment, a thresholding stage 19 generates one bit for eachfeature sample, for example, a ‘1’ if the value F(k,n) is above athreshold and a ‘0’ if it is below said threshold. Alternatively, afingerprint bit is given the value ‘1’ if the corresponding featuresample F(k,n) is larger than its neighbor, otherwise it is ‘0’. To thisend, the feature samples F(k,n) are first filtered in a one-dimensionaltemporal filter 18. The present embodiment uses an improved version ofthe latter alternative. In thus preferred embodiment, a fingerprint bit‘1’ is generated if the feature sample F(k,n) is larger than itsneighbor and if this was also the case in the previous frame, otherwisethe fingerprint bit is ‘0’. In this embodiment, the filter 18 is atwo-dimensional filter. In mathematical notation:${{FP}\quad\left( {k,n} \right)} = \left\{ \begin{matrix}1 & {if} & {{{F\quad\left( {k,n} \right)} - {F\quad\left( {{k + 1},n} \right)} - \left( {{F\quad\left( {k,{n - 1}} \right)} - {F\quad\left( {{k + 1},{n - 1}} \right)}} \right)} > 0} \\0 & {if} & {{{F\quad\left( {k,n} \right)} - {F\quad\left( {{k + 1},n} \right)} - \left( {{F\quad\left( {k,{n - 1}} \right)} - {F\quad\left( {{k + 1},{n - 1}} \right)}} \right)} \leq 0}\end{matrix} \right.$When thresholding is used, each sub-fingerprint being extracted from anaudio frame has 32 bits.

Although the invention has been described with reference to audiofingerprinting, it can also be applied to other multimedia signals suchas images and motion video. While speed changes are often applied toaudio signals, affine transformations such as shift, scaling androtation, are often applied to images and video. The method according tothe invention can be used to improve robustness to such affinetransformations. In the case of a two-dimensional signal, thelog-mapping process 151 is changed into log-polar mapping to make itinvariant against rotation as well as scaling (retaining aspect ratio).A log-log mapping makes it invariant to changes of the aspect ratio. Themagnitude of the Fourier-Mellin transform (now a 2D transform) anddouble differentiation of its phase along the frequency axis have thedesired affine invariant property.

Disclosed is a method and arrangement for extracting a fingerprint froma multimedia signal, particularly an audio signal, which is invariant tospeed changes of the audio signal. To this end, the method comprisesextracting (12,13) a set of robust perceptual features from themultimedia signal, for example, the power spectrum of the audio signal.A Fourier-Mellin transform (15) converts the power spectrum into Fouriercoefficients that undergo a phase change only if the audio playbackspeed changes. Their magnitudes or phase differences (16) constitute aspeed, change-invariant fingerprint. By a thresholding operation (19),the fingerprint can be represented by a compact number of bits.

1. A method of extracting a fingerprint from a multimedia signal,comprising the steps of: extracting (12,13) a set of robust perceptualfeatures from the multimedia signal; subjecting (15) the extracted setof features to a Fourier-Mellin transform; converting (16,19) thetransformed set of features into a sequence constituting thefingerprint.
 2. A method as claimed in claim 1, wherein said convertingstep includes converting (16,ABS) the magnitudes of the Fourier-Mellintransform.
 3. A method as claimed in claim 1, wherein said convertingstep includes converting (16,Δφ) the derivative of the phase of theFourier-Mellin transform.
 4. A method as claimed in claim 1, wherein themultimedia signal is an audio signal and said Fourier-Mellin transformincludes a one-dimensional log mapping process being applied to the setof perceptual features.
 5. A method as claimed in claim 1, wherein themultimedia signal is an image or video signal and said Fourier-Mellintransform includes a two-dimensional log-polar mapping process beingapplied to the set of perceptual features.
 6. A method as claimed inclaim 1, wherein the multimedia signal is an image or video signal andsaid Fourier-Mellin transform includes a two-dimensional log-log mappingprocess being applied to the set of perceptual features.
 7. A method asclaimed in claim 1, wherein said extracting step includes normalizationof the set of perceptual features.
 8. An apparatus for extracting afingerprint from a multimedia signal, comprising: means (12,13) forextracting a set of robust perceptual features from the multimediasignal; means (15) for subjecting the extracted set of features to aFourier-Mellin transform; means (16,19) for converting the transformedset of features into a sequence constituting the fingerprint.